Language Identification Investigation
| Author | Aaron Binns |
|---|---|
| Date | January 10, 2011 |
Summary
Evaluation of Java library used to identify the human language of archival web page content. Analysis of test results and thoughts on problems/challenges.
Introduction
A very common request of web archive users is to identify the human language in which the web page is written. Individual end-users may wish to know in which language a page is written. Researchers may only want to study pages in a particular language. Web archivists may only wish to harvest pages written in particular languages; using language as a factor in determining whether or not a page is inside or outside of a domain harvest's parameters.
Although HTML for many years has supplied mechanisms for declaring the language of the page content, they have not been widely used and cannot be relied upon for large-scale language identification. However, if language metadata is provided, it makes sense to utilize it in some capacity.
Language vs. Encoding
A related, and popular topic is "character encoding detection", which is not the same as language identification. Character encoding is the encoding and decoding of abstract characters to and from bytes. Language identification is taking those character sequences and determining to which human language they belong.
In this investigation, we kept the encoding issues separate and assumed that bytes have been properly decoded into characters. The language identification system then operates on characters only.
In some cases, the encoding can be used to guide the language identification. Although encodings like UTF-8 can encode all Unicode characters; others, such as KOI8-R only handles ASCII and Cyrillic and was originally intended for use with Russian and related languages. So, if we encounter a web page using KOI8-R, we could use that as a hint to the language identification system.
Identification vs. Dectection
As a matter of terminology, it seems that "identification" and "detection" are used interchangeably. I prefer to use "identification" to avoid confusion with "encoding detection" described in the previous section. However, some of the open source projects examined in this report use "language detection" in the project name; which I will use when refering to those projects.
Implementation Survey
A few simple web searches for "language identification" or "language detection" reveal a number of implementations, both commercial and open source.
One of the more popular commercial solutions is from Basis Technology, who provide an entire suite of language processing and analysis tools.
In the open source world, it seems that TextCat pioneered the application of a statistical n-gram approach as described in Cavnar, W. B. and J. M. Trenkle, N-Gram-Based Text Categorization, Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994. TextCat spawned a number of derivative implementations in various programming languages.
Language Detection Library for Java
I chose to evaluate Language Detection Library for Java by Shuyo Nakatani.
The full details of the algorithm are described in the presentation linked on the project homepage.
To summarize, the blob of text is split into words, and the words are split into characters. Sub-sequences of characters are then matched against a library of subsequences with associated probabilities that the subsequence belongs to a particular language.
For example, "ion" and "thr" are common three-letter sequences in English; "äch" and "ünd" in German; and "şte" and "ată" in Romanian. The actual language profiles are much more elaborate and assign probabilities to each character sequence. So that even if a sequence is common in more than one language, when given a non-trivial input sequence, the languages can still be distinguished.
The basic algorithm is to take the 2 or 3-character sequences of each word, find the probability that the sequence is in each language and combine the probabilities as the text is analyized. In the end, zero or more languages are identified with a confidence score for each.
In many cases, a single language is identified with a high confidence, e.g. 99%. In some cases, multiple languages are identified, with decreasing confidence scores. And sometimes, the algorithm fails to identify any language at all.
Code
The code can be dowloaded from the Google Code project page:
You need two things:
- langdetect.jar — the language detection library
- profiles/ — the language profiles in JSON format
A simple command-line driver is provided:
Since our main concern is to identify the language of web content, the simple command-line driver isn't directly applicable. What we want is to pass the text of a web page to the language identification library and get back a list language+confidence scores.
Since my pet project The JBs already provides the framework for running a Hadoop MapReduce job on NutchWAX segments (which contain the extracted webpage text), I hacked together some code to run the language identification inside a MapReduce job.
Integration
On the whole, integrating the langdetect library into
The JBs and MapReduce was fairly straightforward. I chose to
perform the language identification during the map stage, emitting
the web page unique key (URL+hash) along with the
JSON-format string listing the language+confidence pairs.
In simplified form, removing some exception handling and other syntactic sugar, the code looks like:
The 0.5 value is taken from the project's example code.
There were two problems integrating the langdetect
libary into the Hadoop framework. Both had to do with assumptions
in the library which were valid in a stand-alone program but failed
in the Hadoop distributed environment.
Static Initialization
The first problem had to do with the fact that
the langdetect library required one-time
initialization. In a stand-alone program running in a single Java
VM, this is easily achieved in many ways: static initializer, in the
main() method, etc.
However, when running in Hadoop, one has to handle both the local and distributed contexts. If Hadoop is running on a local machine, then a single JVM is used; whereas in distributed mode multiple JVMs are spawened across many nodes for the various Map and Reduce tasks. In all cases the library must be initialized for each JVM.
To complicate matters, the library prohibits multiple initializations. If the initialization routine is called more than one time in a single JVM, the library throws an exception.
Ultimately, these problems were all solved by putting the
initialization code into the Mapper
and Reducer constructors then catching and ignoring any
exceptions thrown due to multiple initialization. This handles both
the local single-node-single-JVM case as well as the distributed
environment.
Directory of Profiles
The other problem only occurrred in the Hadoop distributed
environment, and is nearly the same as a problem I already
encountered in NutchWAX. The langdetect library wants
to open its profiles/ directory as an on-disk directory
and iterate through the files contained therein.
This works just fine in the local single-node-single-JVM mode
because the jar file containing our Hadoop MapReduce
code is expanded in a directory in ${HADOOP_TMP}
with the JVM's current working directory set to it. Thus, the
langdetect code can open the
on-disk
However, in Hadoop 0.20 distributed mode, the jar file
is sent over the wire to the TaskTrackers to run the MapReduce code,
but is is not expanded into a temporary directory.
When the langdetect library tries to open the
directory, it fails.
I ran into essentially the same problem in NutchWAX where the code wanted to iterate through a directory listing of plugins. The work-around contains two parts
- Set the Hadoop configuration
property
mapreduce.job.jar.unpack.patterntotrue to force Hadoop to expand
the - Custom code to find the directory relative to the location of
the unpacked
jarfile, which is in an unpredictable subdir under${HADOOP_TMP}.
jar file in a temp directory.
Nasty little hack, but it does the job.
Tests
I ran tests on sample web content from three different collections with different language profiles. The results of each are outlined below.
U.S. Congressional Websites
The first test was on a small set of WARC files from a Library of Congress collection. I keep these on my local machine for quick testing of new ideas and experimental code. They are primiarly pages from U.S. congressional websites, such as senators' homepages. I expect that the will all (or nearly all) be in English.
The langdetect library identified most as English, and
although I did not examine every result, I assume they are correct.
It also detected a small number of pages in Spanish, which was also
correct as many U.S. senators and representatives have a pages
targeted to their Spanish-speaking constituents.
A small number of pages were (mis)identified as a seemingly random language; but upon further examination it turned out that these pages almost always contained primarily computer source code, such as JavaScript or CSS. An obvious improvement to my test code is to exclude such pages from analysis.
Biblioteca Nacional de España
I then ran a test of a large sample of web pages from the 2010 domain harvests IA performed for Biblioteca Nacional de España. I expected a large portion of the pages to be identified as Spanish.
Overall, the aggregate results confirmed my expectations. Most of the pages were identified as Spanish, with English being the second most frequent. It also identified many pages as Catalan, of which I examined a few and they appeared correctly identified.
What was strange was the large number of pages identified as being in languages which are rather rare on the web: such as Guaraní.
Library of Congress Iraq War Collection
A test on a 5000-arcfile sample of the LoC Iraq War Collection exhibited similar aggregrate meta-results to the BNE collection. The top 10 are as follows:
| ISO-649-1 | Language | # Pages |
|---|---|---|
| ar | Arabic | 1,214,679 |
| az | Azerbaijani | 252,734 |
| en | English | 135,899 |
| gn | Guaraní | 34,814 |
| da | Danish | 30,167 |
| my | Burmese | 1,1840 |
| fo | Faroese | 8,880 |
| is | Icelandic | 4,931 |
| fr | French | 2,327 |
| hr | Croatian | 1,117 |
Although, as one would expect, Arabic and English are in the top three, it's rather strange that Azerbaijani, Guaraní and Danish would be the others in the top five.
Looking at pages identified as Azerbaijani, it seems that the pages
in the Iraq War collection trigger a common failure-mode in the use
of the langdetect library: pages with multiple
languages.
I examined a dozen or so pages identified as Azerbaijani and they all seemed to have a mixture of Arabic and English content. One illustrative example was a homepage on blogger.com where the page template was in English, but the blogger's name, description and so forth were in Arabic.
I suspect that when given a page with multiple languages as a single
input to the langdetect library, it will often times
mis-identify it as an otherwise rarely occurring language.
Analysis
As mentioned above, analyzing the tests results immediately reveal two issues:
- Program source code: Unstripped HTML, JavaScript, CSS, etc.
- Multiple languages on a single page
The first is easily addressed by looking at the MIME type of the page, or perhaps even the filename. In general we omit these types when performing language analysis on web content, and only did not do so here due to programmer haste to get the test code running.
Multiple Languages
A much more complex issue is multiple languages on a single page. It immediately calls into question the very notion of assigning a single "language" to a page. Consider the blogger.com example from the previous section. Should that page be identified as solely Arabic or soley English? Wouldn't it be more useful to state that it contains content in both languages?
We should rephrase the original question, not asking "Which language is the page?" but rather "Which languages are used on this page?"
Ideally, we would use the HTML markup structure to separate the page into chunks and analyze each one. But in our current processing system, the page's plain text is extracted and all the HTML markup is discarded. But this is something that we could revisit.
Multiple scripts
When analyzing the results from the Iraq War collection test, I was puzzled by the large number of pages identified as Azerbaijani. Then I discovered that Azerbaijani can be written in both Arabic and Latin scripts. For example:
| Script | Text |
|---|---|
| Arabic | آذربایجان دا انسان حاقلاری ائوی آچیلاجاق ب م ت ائلچيسي برمه موخاليفتي نين ليدئري ايله گؤروشه بيليب ترس شوونيسم فارس از آزادي ملتهاي تورکمن |
| Latin | a az qalıb breyn rinq intellektual oyunu üzrə yarışın zona mərhələləri keçirilib miq un qalıqlarının dənizdən çıxarılması davam edir məhəmməd peyğəmbərin karikaturalarını çap edən qəzetin baş redaktoru iş otağında ölüb |
Reconsidering the statistical n-gram algorithm, I hypothesize that on a page of mixed Arabic and English, the Arabic part matches the Arabic-script Azerbaijani n-grams and the English part matches the Azerbaijani n-grams. And since Arabic won't match the English parts, and English won't match the Arabic parts, Azerbaijani is the one that matches both.
And there are many languages which utilize more than one script, Azerbaijani is not the only one. For example Bosnian, Serbian, Turkmen, Tatar, Uyghur, and Uzbek all use both Latin and Cyrillic scripts; Kazakh and Kyrgyz use all three: Latin, Cyrillic and Arabic.
Since scripts are easily distinguished in Unicode, an obvious tactic to try is to first separate the words by script, then analyze each one separately. So for the Iraq War pages with both Arabic and English, we would separate the Arabic parts from the Latin parts and analyze each in turn. This would produce a multi-valued result — that the page has both Arabic and English — rather than forcing the library to make a single, erroneous choice for the entire page.
Chromium
After pondering the test results, I took a look at Chromium, the OpenSource project behind the Google Chrome web browser. Chrome has a built-in language identification feature and I was curious if it was in the open source project. Fortunately, it is:
contains the "compact language detection library".
Reading through the code, it appears that they use a similar algorithm based on statistical n-grams with profiles for each language. They also incorporate many of the speculations in the previous sections as well as other enhancements:
- segregate text by script: Arabic, Cyrillic, Latin, CKJ.
- Group English, French, Italian, German and Spanish (E-FIGS) together at first, then re-analyze within these languages.
- Add "hints" to scoring formula which can boost languages based on page metadata, top-level domain, etc.
- Ignore sections of text that look like page navigation, or are not dense enough.
Conclusions
As outlined in the sections above, a few tentative conclusions can be drawn:
- For non-trivial blocks of text in a single language, the
langdetectlibrary works well. - Program source code triggers misidentification.
- Segregating text in different scripts helps avoid misidentification.
- A number of hypothesized techniques are implemented in Chromium and should be considered.
One thing is for sure, language identification is not simple and is not "fire and forget". To achieve a language identification capability suitable for web archives will require further experimentation and refinement.
Addendum: Java 6 Unicode Support
I recently came across this note on StackOverflow.com, which points out that Java 6 only supports Unicode 4.0, where the current (June 2011) version of Unicode is 6.0:
Obtaining unicode characters of a language in Java
Although it probably doesn't make much difference to people working with European languages, I believe post-4.0 Unicode greatly enhances support for Asian languages, bi-directional handling, and lots of other errata. The aspect of most relevance to this paper is the use of Unicode script properties.
In Java, you can use Unicode categories and blocks to easily identify character properties. For example, the category "Lu" is all upper-case letters in all languages; and "InGreek" is the block of Greek characters. These properties are accessible in regular expressions too, allowing a regex to get all upper-case letters with:
Now, the post on StackOverflow points out that using Unicode blocks to identify the script of a character is wrong because some characters are spread over different language blocks and sometimes blocks contain characters from another script. That means that using:
will match the 18 characters that are in the Greek block but are not actually Greek characters. Yikes.
A reliable way to identify the script is to use the Unicode 6.0 "script" property. The problem is that Java 6 doesn't have Unicode 6.0 support, Java 6 is only Unicode 4.0 and therefore does not have the script property for Unicode characters. The forthcoming Java 7 is supposed to have full Unicode 6.0 support.
As a work-around, the StackOverflow poster provides links to some Perl programs which can be used to query the Unicode 6.0 character database, including the script property. With Perl (and Unicode 6.0) one can use a regular expression '\p{Script=Greek}' to match all Greek characters.
So, if we want to use Java to determine to which script a character belongs, e.g. is it Greek, Arabic, Cyrillic, Latin, etc., we cannot get an entirely accurate answer with Java 6.
A potential work-around might be to use the Perl scripts to generate tables of Unicode code-points for each script of interest (Greek, Arabic, etc.), then code up our own Java routines to use those tables to identify script membership for characters.